Language Identification from Text Using N-gram Based Cumulative Frequency Addition

نویسندگان

Bashir Ahmed

Sung-Hyuk Cha

Charles Tappert

چکیده

This paper describes the preliminary results of an efficient language classifier using an ad-hoc Cumulative Frequency Addition of N-grams. The new classification technique is simpler than the conventional Naïve Bayesian classification method, but it performs similarly in speed overall and better in accuracy on short input strings. The classifier is also 5-10 times faster than N-gram based rank-order statistical classifiers. Language classification using N-gram based rank-order statistics has been shown to be highly accurate and insensitive to typographical errors, and, as a result, this method has been extensively researched and documented in the language processing literature. However, classification using rank-order statistics is slower than other methods due to the inherent requirement of frequency counting and sorting of N-grams in the test document profile. Accuracy and speed of classification are crucial for a classier to be useful in a high volume categorization environment. Thus, it is important to investigate the performance of the N-gram based classification methods. In particular, if it is possible to eliminate the counting and sorting operations in the rank-order statistics methods, classification speed could be increased substantially. The classifier described here accomplishes that goal by using a new Cumulative Frequency Addition method.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Detection of Foreign Entities in Native Text Using N-gram Based Cumulative Frequency Addition

This paper describes a logarithmic version of the conventional Naïve Bayesian N-gram-based, textclassification algorithm that we name Cumulative Frequency Addition (CFA) and its application in three tasks: language identification, nationality identification from names, and detection of foreign words in base text. The new CFA technique is 3-10 times faster than N-gram based rank-order statistica...

متن کامل

Author gender identification from text using Bayesian Random Forest

Nowadays high usage of users from virtual environments and their connection via social networks like Facebook, Instagram, and Twitter shows the necessity of finding out shared subjects in this environment more than before. There are several applications that benefit from reliable methods for inferring age and gender of users in social media. Such applications exist across a wide area of fields,...

متن کامل

Improved language identification using support vector machines for language modeling

Automatic language identification (LID) decisions are made based on scores of language models (LM). In our previous paper [1], we have shown that replacing n-gram LMs with SVMs significantly improved performance of both the PPRLM and GMMtokenization-based LID systems when tested on the OGI-TS corpus. However, the relatively small corpus size may limit the general applicability of the findings. ...

متن کامل

Language identification based on n-gram frequency ranking

We present a novel approach for language identification based on a text categorization technique, namely an n-gram frequency ranking. We use a Parallel phone recognizer, the same as in PPRLM, but instead of the language model, we create a ranking with the most frequent n-grams, keeping only a fraction of them. Then we compute the distance between the input sentence ranking and each language ran...

متن کامل

بازشناسی متون فارسی با استفاده از مدل زبانی n-gram و پالایش گرامری

Abstract Text recognition has been one of the growing research topics in recent years. Many of these researches have focused on recognition of letters and sub-words as a basis for identifying larger text structures such as words, phrases and sentences. This thesis presents a new method in which the recognized sub-words are combined in order to provide meaningful words and sentences in Farsi tex...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2004

Language Identification from Text Using N-gram Based Cumulative Frequency Addition

نویسندگان

چکیده

منابع مشابه

Detection of Foreign Entities in Native Text Using N-gram Based Cumulative Frequency Addition

Author gender identification from text using Bayesian Random Forest

Improved language identification using support vector machines for language modeling

Language identification based on n-gram frequency ranking

بازشناسی متون فارسی با استفاده از مدل زبانی n-gram و پالایش گرامری

عنوان ژورنال:

اشتراک گذاری